About the project

The course focuses on important statistical techniques utilised in research. It is hands-on, where I get to learn how to use various tools such as R, github, etc. The main theme is open science which is the new paradigm that is being pushed in today’s world.

This is the link to my repository


Regression and model validation

Describe the work you have done this week and summarize your learning.

This week focuses on data wrangling, regression analysis and model validation, In a survey, some questions were asked to understand the students’ approaches used in learning. Based on these, the following variables were extracted: gender, age, examination points, strategic approach, deep learning approach, attitude and surface learning approach.

The result of my analysis showed that there were about two times more female students respondents than their male counterparts. Overall, attitude seems to be the most determining factor towards the success of students in their examination. Strategic learning approach also seems to contribute to some extent to the success. While age showed a very low negative effect, it cannot be concluded to be a major factor as most of the respondents were young students below 26.

To validate my model, I used the QQ plot, residual vs fitted and residuals vs leverage. In making choice of my predictor variables, I examined the standard error, Multiple R squared(coefficient of determination) and the p value to check the significance of variables. ggpairs() function also came in handy when getting a broad overview of the distribution of the variables.

#read the students2014 data

lrn14<- read.table("C:/Users/oyeda/Desktop/OPEN_DATA_SCIENCE/IODS-project/data/learning2014.txt ", header=T, sep = "\t")

#Alternatively, I could also read the already wrangled data as below:
#lrn14<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/co#urse_2218/datasets/learning2014.txt ", header=T, sep = ",")


#structure of the data
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 4.75 3.83 3.25 4.33 4 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 2.42 1.92 2.83 2.17 3 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 3.62 2.25 4 4.25 3.5 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
#dimension of the data
dim(lrn14)
## [1] 166   7

The data includes 7 variables which describes several attributes of students:

#load packages
library(ggplot2)
#install.packages("GGally")
library(GGally)
#create a pair plot to see the broad overview of the data
ggpairs(lrn14, aes(col=gender), lower = list(combo = wrap("facethist", bins = 20)))

#summary of the data
summary(lrn14)
##       deep           surf            stra            age       
##  Min.   :1.58   Min.   :1.580   Min.   :1.250   Min.   :17.00  
##  1st Qu.:3.33   1st Qu.:2.420   1st Qu.:2.620   1st Qu.:21.00  
##  Median :3.67   Median :2.830   Median :3.185   Median :22.00  
##  Mean   :3.68   Mean   :2.787   Mean   :3.121   Mean   :25.51  
##  3rd Qu.:4.08   3rd Qu.:3.170   3rd Qu.:3.620   3rd Qu.:27.00  
##  Max.   :4.92   Max.   :4.330   Max.   :5.000   Max.   :55.00  
##     attitude         points      gender 
##  Min.   :1.400   Min.   : 7.00   F:110  
##  1st Qu.:2.600   1st Qu.:19.00   M: 56  
##  Median :3.200   Median :23.00          
##  Mean   :3.143   Mean   :22.72          
##  3rd Qu.:3.700   3rd Qu.:27.75          
##  Max.   :5.000   Max.   :33.00

By and large, we have almost twice as much female respondents out of 166 students, aside those with examination point 0. The average age is about 26 years with, the youngest being 17 years and an old student of age 55. The ages of the respondents are mainly skewed to the right, showing they are mostly young students below 26 years. We can also see from the box plot that the age the age is non-normal and skewed to the right with some outliers.

The attitude, points and deep learning approach also have few outliers. Generally, the pairs show weak correaltions. However, attitude seems to be the most strongly correlated(although still weak) with points, for male and female. Men generally seem to have a slightly better attitude than their female counterpart, although, some male students show very poor attitude towards learning.

Both genders also appear to have average to very deep learning approach. Female students have a slightly better strategic approach. However, there appears to be not much distinctions between examinations points of male and female students. Most students seem to have more above average score while a few students showed very poor performance.

trying out the model using all the variables as predictors

l_model<-lm(points~gender+age+attitude+deep+stra+surf, data = lrn14)
summary(l_model)
## 
## Call:
## lm(formula = points ~ gender + age + attitude + deep + stra + 
##     surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4513  -3.2513   0.3836   3.5240  11.4869 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.32352    5.25571   3.486 0.000633 ***
## genderM     -0.06009    0.92983  -0.065 0.948551    
## age         -0.09606    0.05396  -1.780 0.076916 .  
## attitude     3.44426    0.59775   5.762 4.19e-08 ***
## deep        -1.04826    0.78408  -1.337 0.183155    
## stra         0.95823    0.55156   1.737 0.084271 .  
## surf        -1.11001    0.84433  -1.315 0.190519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.265 on 159 degrees of freedom
## Multiple R-squared:  0.2312, Adjusted R-squared:  0.2022 
## F-statistic: 7.971 on 6 and 159 DF,  p-value: 1.588e-07

I attempted to use all the variables for the model and stepwise regression to help remove the redundant variables.

library(MASS)   #load package
stepw_reg<-stepAIC(l_model, direction = "both")
## Start:  AIC=558.34
## points ~ gender + age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - gender    1      0.12 4408.0 556.35
## - surf      1     47.91 4455.8 558.14
## - deep      1     49.55 4457.5 558.20
## <none>                  4407.9 558.34
## - stra      1     83.67 4491.6 559.46
## - age       1     87.88 4495.8 559.62
## - attitude  1    920.43 5328.3 587.82
## 
## Step:  AIC=556.35
## points ~ age + attitude + deep + stra + surf
## 
##            Df Sum of Sq    RSS    AIC
## - surf      1     47.82 4455.8 556.14
## - deep      1     49.66 4457.7 556.21
## <none>                  4408.0 556.35
## - stra      1     88.28 4496.3 557.64
## - age       1     90.18 4498.2 557.71
## + gender    1      0.12 4407.9 558.34
## - attitude  1    999.58 5407.6 588.27
## 
## Step:  AIC=556.14
## points ~ age + attitude + deep + stra
## 
##            Df Sum of Sq    RSS    AIC
## - deep      1     27.03 4482.9 555.14
## <none>                  4455.8 556.14
## + surf      1     47.82 4408.0 556.35
## - age       1     75.38 4531.2 556.92
## - stra      1    106.09 4561.9 558.04
## + gender    1      0.02 4455.8 558.14
## - attitude  1   1085.20 5541.0 590.32
## 
## Step:  AIC=555.14
## points ~ age + attitude + stra
## 
##            Df Sum of Sq    RSS    AIC
## <none>                  4482.9 555.14
## - age       1     76.60 4559.5 555.95
## + deep      1     27.03 4455.8 556.14
## + surf      1     25.19 4457.7 556.21
## - stra      1     97.54 4580.4 556.71
## + gender    1      0.01 4482.9 557.14
## - attitude  1   1061.20 5544.1 588.41

After using stepwise regression, I ended up with three explanatory variables namely: ‘age’,‘attitude’ and ‘stra’

Now, let’s refit the model with these three variables.

l_model2<-lm(points ~ age + attitude + stra, data = lrn14)
summary(l_model2)
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1199  -3.2006   0.3307   3.4131  10.7605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89473    2.64895   4.113 6.20e-05 ***
## age         -0.08821    0.05302  -1.664   0.0981 .  
## attitude     3.48143    0.56219   6.193 4.69e-09 ***
## stra         1.00324    0.53436   1.877   0.0623 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.072e-08

As shown by the model, the p value of ‘attitude’ is very low below 0.05 and with higher confidence that ‘attitude’ significantly affects the examination points. However, the coeffient of determination(Multiple R-squared) of the model is just (0.2182)21.82% which is relatively low. The standard error of age is quite low but age seems insignificant. Just as points and age showed a low negative correlation earlier, it can be seen that age has a slight negative effect(-0.08821) on the examination points while attitude, which showed slightly higher positive correlation with points earlier, has a positive effect(3.48143) on the examination points

I will remove the age and strategic approach because they are insignificant with p-values lower than 0.05.

Finally, I am left with predictors;‘attitude’ attitude seems to have positive effect on the exam points.

#Final model
lm_final<-lm(points~attitude, data = lrn14)
summary(lm_final)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Although, the coefficient of determination(Multiple R-Squared) reduced slightly to 19.06% from 21.82% and the standard error only increased from 5.26 to 5.32. The difference is not so considerable. The p-value is also well below 0.05.

Thus, I decided to use ‘attitude’. This is in accordance to parsimony whereby it is advisable to use as less predictors as possible for building models.

par(mfrow=c(2,2))
plot(lm_final, which = c(1,2,5))

As seen in the Normal Q-Q plot, the points fit well to the line, aside few deviations at the beginning and end. The errors can thus, be said to be normally distributed.

For the residuals vs fitted, it can be seen that there is a slight change as the values inceas. However, no observation seems to have clear leverage over others.


Logistic regression

The data includes students’ secondary education accomplishment of two portuguese schools. The data attributes include student grades, demographic, social and school related features’) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details). This is the data source

Data Set Information:

Data was gotten from this source

# access the tidyverse libraries tidyr, dplyr, ggplot2
#install.packages("tidyr")
library(tidyr); library(dplyr); library(ggplot2); library(GGally);

#read the data
alc<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header = T)

DATA OVERVIEW

colnames(alc)   #names of columns
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
summary(alc)    #summary
##  school   sex          age        address famsize   Pstatus
##  GP:342   F:198   Min.   :15.00   R: 81   GT3:278   A: 38  
##  MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104   T:344  
##                   Median :17.00                            
##                   Mean   :16.59                            
##                   3rd Qu.:17.00                            
##                   Max.   :22.00                            
##       Medu            Fedu             Mjob           Fjob    
##  Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##  Median :3.000   Median :3.000   other   :138   other   :211  
##  Mean   :2.806   Mean   :2.565   services: 96   services:107  
##  3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##  Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.442  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.034   Mean   :0.2906                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel        freetime         goout      
##  no : 18   no :261   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.00   1st Qu.:3.000   1st Qu.:2.000  
##                      Median :4.00   Median :3.000   Median :3.000  
##                      Mean   :3.94   Mean   :3.223   Mean   :3.113  
##                      3rd Qu.:5.00   3rd Qu.:4.000   3rd Qu.:4.000  
##                      Max.   :5.00   Max.   :5.000   Max.   :5.000  
##       Dalc            Walc          health         absences     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:3.000   1st Qu.: 0.000  
##  Median :1.000   Median :2.00   Median :4.000   Median : 3.000  
##  Mean   :1.474   Mean   :2.28   Mean   :3.579   Mean   : 5.319  
##  3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000   3rd Qu.: 8.000  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000   Max.   :75.000  
##        G1              G2              G3           alc_use     
##  Min.   : 3.00   Min.   : 0.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 8.00   1st Qu.: 8.25   1st Qu.: 8.00   1st Qu.:1.000  
##  Median :10.50   Median :11.00   Median :11.00   Median :1.500  
##  Mean   :10.86   Mean   :10.71   Mean   :10.39   Mean   :1.877  
##  3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :19.00   Max.   :19.00   Max.   :20.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:270      
##  TRUE :112      
##                 
##                 
## 
dim(alc)    #dimension
## [1] 382  35
str(alc)   #structure
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
attach(alc)   #attach all the variables in the dataframe
# define a new logical column 'high_use'
alc$high_use<- alc$alc_use > 2
#alc <- mutate(alc, high_use = alc_use > 2) #alternative

Hypotheses.

  • age is not related to alcohol consumption
  • sex is not related to alcohol consumption
  • absence from school is not related to alcóhol use
  • freetime after school is not related to alcohol consumption

subsetting and exploring my chosen variables

hyp<- alc[,c("age", "sex",  "absences","freetime","alc_use")]
#exploring my chosen variables
# draw a bar plot of each variable
hyp<-gather(hyp) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")
hyp + geom_bar()

The above shows the distribution of the chosen predictors and also scale of alcohol consumption. The ages of the students are mainly within 15-19 with few students aged 20 and 22. Generally, there are relatively few chronic alcohol consumers amongst the students. The students seem to have quite enough free time. The respondents include 198 female students and 184 male students.

Overview of the chosen predictors and alcohol consumption variable.

hyp<- alc[,c("age", "sex",  "absences","freetime","alc_use")]
#show the overview of the predictors and response variable(non-binomial)
plot_hyp <- ggpairs(hyp, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
plot_hyp   #draw the plot

Here, we can see that the female students seem to be generally quite older and there are both male and female students that are much older than the general sample of the students. The absences are not so high but some chronic absentees amongst the students(both male and female). Most of the students are quite free but a few of the students are very busy. Largely, all the predictors show no correlation, hence, the issue of multicolinearity is not a problem here. However, these predictors still show very weak correlation with alcohol consumption. Nevertheless, it should be recalled that correlation is not causation. Hence, I will dig further to understand the effects of these predictors on alcohol consumption amongst students and see if they are significant.

Crosstabs

The below are crosstabs that compare the predictors with alcohol consumption

#using the binomial response(high_use)
#it is also possible to use simple crosstabs e.g
#xtabs(high_use~age, data = alc)
#but below is a more comprehensive crosstab.
#Crosstabs
summarise(group_by(alc, age,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 12 x 4
## # Groups:   age [?]
##      age high_use count mean_grade
##    <int>    <lgl> <int>      <dbl>
##  1    15    FALSE    63  11.380952
##  2    15     TRUE    18  11.055556
##  3    16    FALSE    79  11.303797
##  4    16     TRUE    28  10.714286
##  5    17    FALSE    65  10.123077
##  6    17     TRUE    35  10.228571
##  7    18    FALSE    55   9.218182
##  8    18     TRUE    26   9.423077
##  9    19    FALSE     7   5.714286
## 10    19     TRUE     4   6.250000
## 11    20    FALSE     1  18.000000
## 12    22     TRUE     1   8.000000
#an alternative and preferrable way of doing the above
#alc %>% group_by(age, high_use) %>% summarise(count=n(), mean_grade =  mean(G3))

summarise(group_by(alc, sex,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 4 x 4
## # Groups:   sex [?]
##      sex high_use count mean_grade
##   <fctr>    <lgl> <int>      <dbl>
## 1      F    FALSE   157   9.687898
## 2      F     TRUE    41  10.414634
## 3      M    FALSE   113  11.610619
## 4      M     TRUE    71   9.971831
summarise(group_by(alc, absences,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 48 x 4
## # Groups:   absences [?]
##    absences high_use count mean_grade
##       <int>    <lgl> <int>      <dbl>
##  1        0    FALSE    94   8.372340
##  2        0     TRUE    22   8.318182
##  3        1    FALSE     2  12.000000
##  4        1     TRUE     1  15.000000
##  5        2    FALSE    54  12.296296
##  6        2     TRUE    13  11.076923
##  7        3    FALSE     2  11.000000
##  8        3     TRUE     4  13.000000
##  9        4    FALSE    36  11.666667
## 10        4     TRUE    15  10.200000
## # ... with 38 more rows
summarise(group_by(alc, freetime,high_use), count=n(), mean_grade=mean(G3))
## # A tibble: 10 x 4
## # Groups:   freetime [?]
##    freetime high_use count mean_grade
##       <int>    <lgl> <int>      <dbl>
##  1        1    FALSE    16    9.12500
##  2        1     TRUE     2    8.50000
##  3        2    FALSE    47   11.87234
##  4        2     TRUE    15   11.60000
##  5        3    FALSE   116    9.75000
##  6        3     TRUE    40    9.82500
##  7        4    FALSE    69   10.46377
##  8        4     TRUE    40   10.10000
##  9        5    FALSE    22   12.54545
## 10        5     TRUE    15    9.80000

Bar chats

The age distribution by sex

# initialize a plot of 'age'
sex_age <- ggplot(data = alc, aes(x=age, col=sex))

# draw a bar plot of age by sex
sex_age + geom_bar() + facet_wrap("sex")

high alcohol consumption by sex

# initialize a plot of 'high_use'
hu_sex <- ggplot(data = alc, aes(x=sex, col=sex))

# draw a bar plot of high_use by sex
hu_sex + geom_bar() + facet_wrap("high_use")

Here, we can see that there are more female who do are not high consumers of alcohol compared to men. Men consume more alchols than their female counterparts.

High use vs freetime

# initialize a plot of 'high_use'
hu_ft <- ggplot(data = alc, aes(x=freetime, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_ft + geom_bar() + facet_wrap("high_use")

Most students with more freetime seem not to consume alcohol as one would expect

High use vs absences

# initialize a plot of 'high_use'
hu_ab <- ggplot(data = alc, aes(x=absences, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_ab + geom_bar() + facet_wrap("high_use")

#few of the absentees are alcoholic but many are not.

High use vs age

# initialize a plot of 'high_use'
hu_ag <- ggplot(data = alc, aes(x=age, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_ag + geom_bar() + facet_wrap("high_use")

some of the young students below 20 years are high consumers of alcohol, but less so of those that are not high alchol consumers.

High use vs romantic

# initialize a plot of 'high_use'
hu_rom <- ggplot(data = alc, aes(x=romantic, col=sex))

# draw a bar plot of 'high_use' by freetime
hu_rom + geom_bar() + facet_wrap("high_use")

For students in romantic relationships, they generally have lower number of them that are high consumers of alcohol. This is same with those in no romatic relationships.

BOXPLOTS

**Below are boxplots to give clearer pictures, as explained earlier. ##Relationship of alcohol consumption with absences

# initialise a plot of high_use and absences
h_ab<- ggplot(alc, aes(x=high_use, y=absences,col=sex))

# define the plot as a boxplot and draw it
h_ab + geom_boxplot() + ggtitle("Student absences vs alcohol consumption")

There are a few chronic absentees amongst male and female students, but generally, most of the students have low absences.

Relationship of alcohol consumption with age

# initialise a plot of high_use and age
h_ag<- ggplot(alc, aes(x=high_use, y=age,col=sex))

# define the plot as a boxplot and draw it
h_ag + geom_boxplot() + ggtitle("Students' age vs alcohol consumption")

Relationship of alcohol consumption with freetime

# initialise a plot of high_use and freetime
h_fr<- ggplot(alc, aes(x=high_use, y=freetime,col=sex))

# define the plot as a boxplot and draw it
h_fr + geom_boxplot() + ggtitle("Student's freetime vs alcohol consumption")

Relationship of alcohol consumption with sex

# initialise a plot of high_use and sex
alc_sex<- ggplot(alc, aes(y=alc_use, x=sex,col=sex))

# define the plot as a boxplot and draw it
alc_sex + geom_boxplot() + ggtitle("Student's alcohol consumption by sex")

There are a few female students with quite high alcohol consumption but the alcohol consumption levels do not vary as much as they do amongst the male students.

Logistic regression

fitting the glm model

high_use_mod1<- glm(high_use~ age + sex + absences + freetime , data= alc, family = "binomial")
summary(high_use_mod1)
## 
## Call:
## glm(formula = high_use ~ age + sex + absences + freetime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5188  -0.8214  -0.5957   1.0480   2.1082  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.35260    1.76384  -3.035 0.002408 ** 
## age          0.15609    0.10221   1.527 0.126698    
## sexM         0.89550    0.24773   3.615 0.000301 ***
## absences     0.07162    0.01802   3.974 7.08e-05 ***
## freetime     0.30118    0.12578   2.394 0.016644 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.53  on 377  degrees of freedom
## AIC: 427.53
## 
## Number of Fisher Scoring iterations: 4
# coef(high_use_mod1)
# coefficients(high_use_mod1)

The predictors “age” and “freetime”are not significant. Hence, these could be considered as redundant variables and not have consierable effect on alcohol consumption compared with asences and sex which are significant. hence, I could accept the null hypothesis that age and freetime are not related to alcohol consumption. On the other hand, I reject the null hypothesis that male sex and absences are not related to alcohol consumption. They both have positive effects on alcohol consumption.

Refitting the model without the redundant variables

high_use_mod2<- glm(high_use~  sex + absences, data= alc, family = "binomial")
summary(high_use_mod2)
## 
## Call:
## glm(formula = high_use ~ sex + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7368  -0.8501  -0.5838   1.0919   1.9899  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.83117    0.21956  -8.340  < 2e-16 ***
## sexM         0.99923    0.24179   4.133 3.59e-05 ***
## absences     0.07403    0.01811   4.089 4.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 425.79  on 379  degrees of freedom
## AIC: 431.79
## 
## Number of Fisher Scoring iterations: 4
#options("contrasts")

check the overall effect of the variable by performing a likelihood ratio test

anova(high_use_mod1, high_use_mod2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ age + sex + absences + freetime
## Model 2: high_use ~ sex + absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       377     417.53                       
## 2       379     425.79 -2  -8.2563  0.01611 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There seems to be no significant difference between the two models.
Hence, ‘sex’ and ‘absences’ are significant enough without the redundant variables.

let’s also see if “sex” is alos significant in the prediction

high_use_mod3<- glm(high_use~  absences, data= alc, family = "binomial")
anova(high_use_mod2, high_use_mod3, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ sex + absences
## Model 2: high_use ~ absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       379     425.79                         
## 2       380     443.69 -1  -17.907 2.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test shows that sex is important and would remain in the model.

Hence, final model is: high_use~ sex + absences

calculating the odds ratio and confident interval

# compute odds ratios (odds_ra)
odds_ra <- exp(coef(high_use_mod2))
#odds_ra <- coef(high_use_mod2) %>% exp     #alternaive

# compute confidence intervals (conf_int)
conf_int <- exp(confint(high_use_mod2)) 
#Conf_Int <- high_use_mod2 %>%  confint() %>% exp   #alternative

# print out the odds ratios with their confidence intervals
cbind(odds_ra, conf_int)
##               odds_ra     2.5 %    97.5 %
## (Intercept) 0.1602264 0.1022869 0.2424271
## sexM        2.7161808 1.7015022 4.3985356
## absences    1.0768398 1.0414538 1.1176941

Here, we can see that Male sex and absences are positively associated with success and are more likely to affect alcohol consumption. The confident interval of Male sex is quite high compared to “absences” which is narrower and much more likely to affect alcohol consumption.

fit the model

# predict() the probability of high_use
probs<- predict(high_use_mod2, type = "response")

add the predicted probabilities to ‘alc’

alc$prob_high_use <- probs
#alc <- mutate(alc, probability = probabilities)

use the probabilities to make a prediction of high_use, setting 0.5 as threshold

alc$predict_high_use<- (alc$prob_high_use)>0.5
#alc <- mutate(alc, prediction = prob_high_use>0.5)

see the first ten and last ten original classes, predicted probabilities, and class predictions

head(alc[,c("failures", "absences", "sex", "high_use", "prob_high_use", "predict_high_use")], n=10)
##    failures absences sex high_use prob_high_use predict_high_use
## 1         0        6   F    FALSE     0.1998897            FALSE
## 2         0        4   F    FALSE     0.1772568            FALSE
## 3         3       10   F     TRUE     0.2514562            FALSE
## 4         0        2   F    FALSE     0.1566846            FALSE
## 5         0        4   F    FALSE     0.1772568            FALSE
## 6         0       10   M    FALSE     0.4771074            FALSE
## 7         0        0   M    FALSE     0.3032349            FALSE
## 8         0        6   F    FALSE     0.1998897            FALSE
## 9         0        0   M    FALSE     0.3032349            FALSE
## 10        0        0   M    FALSE     0.3032349            FALSE
select(alc, failures, absences, sex, high_use, prob_high_use, predict_high_use) %>% tail(10)
##     failures absences sex high_use prob_high_use predict_high_use
## 373        1        0   M    FALSE     0.3032349            FALSE
## 374        1       14   M     TRUE     0.5509447             TRUE
## 375        0        2   F    FALSE     0.1566846            FALSE
## 376        0        7   F    FALSE     0.2119931            FALSE
## 377        1        0   F    FALSE     0.1380993            FALSE
## 378        0        0   F    FALSE     0.1380993            FALSE
## 379        1        0   F    FALSE     0.1380993            FALSE
## 380        1        0   F    FALSE     0.1380993            FALSE
## 381        0        3   M     TRUE     0.3520937            FALSE
## 382        0        0   M     TRUE     0.3032349            FALSE

tabulate the target variable versus the predictions

table(high_use = alc$high_use, prediction = alc$predict_high_use)
##         prediction
## high_use FALSE TRUE
##    FALSE   263    7
##    TRUE     89   23

The model rightly predicted 263 False high use and 23 True high_use of alcohol. It wrongly predicted 89 True high_use and 7 False high_use

Plotting the predicted and observed alcohol consumption

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = prob_high_use, y = high_use, col= predict_high_use))

# define the geom as points and draw the plot
g + geom_point()

#The wrong preditctions were quite much

tabulate the target variable versus the predictions

conf_mat<-table(high_use = alc$high_use, prediction = alc$predict_high_use)
conf_mat<-prop.table(conf_mat)
addmargins(conf_mat)
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68848168 0.01832461 0.70680628
##    TRUE  0.23298429 0.06020942 0.29319372
##    Sum   0.92146597 0.07853403 1.00000000
#Alternatively, this can be done as shown below:
#addmargins(prop.table(table(high_use = alc$high_use, prediction = alc$predict_high_use)))
#table(high_use = alc$high_use, prediction = alc$predict_high_use) %>%  prop.table() %>% addmargins()

mean error of the prediction

mean(abs(alc$high_use-alc$prob_high_use)>0.5)
## [1] 0.2513089

My model has slightly lower error of 0.25 compared with the one on datacamp with 0.26 mean error.

The below is an alternative way by firstly defining the function to calculate the mean error.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob_high_use)
## [1] 0.2513089

K-fold cross-validation

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = high_use_mod2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2513089

THIS IS THE LINK TO MY DATA WRANGLING


Chapter 4: Clustering and classification

#Get access to the libraries
library(MASS)
library(ggplot2)
library(GGally)
library(tidyr)
#install.packages("plotly")
library(tidyverse)
library(corrplot)
library(dplyr)
library(plotly)
data("Boston")
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
#structure of the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#dimension of the data
dim(Boston)
## [1] 506  14
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

The data includes information about Housing Values in Suburbs of Boston. The data has 14 variables and 506 observations. The relevant information in the dataset is described below: |Variable|Definition |—|————————————————————————————| |crim |per capita crime rate by town| |zn|proportion of residential land zoned for lots over 25,000 sq.ft| |indus|proportion of non-retail business acres per town| |chas|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)| |nox|nitrogen oxides concentration (parts per 10 million)| |rm|average number of rooms per dwelling| |age|proportion of owner-occupied units built prior to 1940| |dis|weighted mean of distances to five Boston employment centres| |rad|index of accessibility to radial highways| |tax|full-value property-tax rate per $10,000| |ptratio|pupil-teacher ratio by town| |black|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town| |lstat|lower status of the population (percent)| |medv|median value of owner-occupied homes in $1000s|

Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)

3. Graphical overview of the data

  • Below are the graohical representations which show the distribution and the correlations of the variables.
#graphical overview of the data
#ggpairs(Boston)
#pairs(Boston)
#plot(Boston)

#summary of the data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper")

From the above, we can see the positive and negative correlations. For instance, industrial land use and nitrogen oxide concentration are positively correlated. Industrial land use is also positively correlated with tax. Crime is positively correlated with index of accessibility to radial highways. “age” and “dis” are negatively correlated and “dis” and “tax” are slightly negatively correlated too. From the correlation plot, we can see other weak to strong negative and positive correaltios between the variables.

Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)

# MASS and Boston dataset are available

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)


# MASS, Boston and boston_scaled are available

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)


# boston_scaled is available

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test[,"crime"]

# remove the crime variable from test data
test <- dplyr::select(test, -crime)


# MASS and train are available

# linear discriminant analysis
lda.fit <- lda(crime~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2500000 0.2524752 0.2574257 
## 
## Group means:
##                  zn      indus         chas        nox          rm
## low       0.8658208 -0.9147850 -0.109974419 -0.8669310  0.47575388
## med_low  -0.1414637 -0.2950439 -0.077423115 -0.5349873 -0.15409253
## med_high -0.3855121  0.2307070  0.306656261  0.4330212  0.05030621
## high     -0.4872402  1.0170690 -0.007331936  1.0524787 -0.42078908
##                 age        dis        rad        tax     ptratio
## low      -0.8405480  0.8198427 -0.6929785 -0.7411873 -0.41136608
## med_low  -0.3206661  0.3095613 -0.5475005 -0.4514064 -0.07001248
## med_high  0.4549900 -0.4140788 -0.4132674 -0.2980777 -0.34943789
## high      0.8052007 -0.8482281  1.6386213  1.5144083  0.78135074
##                black       lstat        medv
## low       0.38116652 -0.77208068  0.52338055
## med_low   0.32760270 -0.10548309 -0.04350642
## med_high  0.06899415  0.05303154  0.15014717
## high     -0.82323795  0.84717450 -0.66305555
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.06290891  0.59746117 -1.01572798
## indus    0.04803986 -0.39679850  0.15831952
## chas    -0.07535677 -0.09053602  0.02169491
## nox      0.35178165 -0.70437538 -1.16959082
## rm      -0.13462817 -0.06737410 -0.18847354
## age      0.21915858 -0.26929628 -0.24134428
## dis     -0.06763543 -0.15811660  0.11576159
## rad      3.19496748  0.75045155 -0.23542633
## tax     -0.02189170  0.26947885  0.72511675
## ptratio  0.09765455  0.06221391 -0.26237805
## black   -0.13537739  0.02960221  0.15122399
## lstat    0.23052124 -0.26912297  0.34110769
## medv     0.22433404 -0.34235984 -0.20350172
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9480 0.0391 0.0129
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
train$crime <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = train$crime, pch= train$crime)
lda.arrows(lda.fit, myscale = 2)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class )
##           predicted
## correct    low med_low med_high high
##   low       20       9        1    0
##   med_low    6      16        3    0
##   med_high   0      10       13    1
##   high       0       0        0   23

Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

# load MASS and Boston
library(MASS)
data('Boston')

boston_standard <- scale(Boston)
# euclidean distance matrix
dist_eu <-dist(boston_standard)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_standard, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# Boston dataset is available

# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston[6:8], col = km$cluster)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston[5:7], col = km$cluster)

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)

boston_standard2<-scale(Boston)
set.seed(123)

# determine the number of clusters
k_max <- 20

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_standard2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_standard2, centers = 9)

#convert km to dataframe
boston_standard2<-as.data.frame(boston_standard2)

lda.fit_clus<- lda(km$cluster~., data=boston_standard2)

# plot the Boston dataset with clusters
pairs(boston_standard2[,3:7], col = km$cluster)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit_clus, dimen = 2, col = classes, pch= classes)
lda.arrows(lda.fit, myscale = 2)

Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

#Next, install and access the plotly package. Create a 3D plot (Cool!) of the #columns of the matrix product by typing the code below.

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities? (0-3 points to compensate any loss of points from the above exercises)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', col=classes)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', col=km$cluster)